o 

(N 
C 

OS 



q 

o 

(N 
> 
O 

o 

(N 
(N 

> 
• i-H 

X 



Building Confidential and Efficient Query 
Services in the Cloud with RASP Data 

Perturbation 

Huiqi Xu, Shumin Guo, Keke Chen 
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Ohio Center of Excellence in Knowledge Enabled Computing 
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Wright State University, Dayton, OH 45435 

Abstract — With the wide deployment of public cloud computing infrastructures, using clouds to host data query services has 
become an appealing solution for the advantages on scalability and cost-saving. However, some data might be sensitive that the 
data owner does not want to move to the cloud unless the data confidentiality and query privacy are guaranteed. On the other 
hand, a secured query service should still provide efficient query processing and significantly reduce the in-house workload to 
fully realize the benefits of cloud computing. We propose the RASP data perturbation method to provide secure and efficient 
range query and kNN query services for protected data in the cloud. The RASP data perturbation method combines order 
preserving encryption, dimensionality expansion, random noise injection, and random projection, to provide strong resilience to 
attacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing indexing techniques 
to be applied to speedup range query processing. The kNN-R algorithm is designed to work with the RASP range query algorithm 
to process the kNN queries. We have carefully analyzed the attacks on data and queries under a precisely defined threat model 
and realistic security assumptions. Extensive experiments have been conducted to show the advantages of this approach on 
efficiency and security. 

Index Terms — query services in the cloud, privacy, range query, kNN query 



1 Introduction 

Hosting data-intensive query services in the cloud is 
increasingly popular because of the unique advan- 
tages in scalability and cost-saving. With the cloud 
infrastructures, the service owners can conveniently 
scale up or down the service and only pay for the 
hours of using the servers. This is an attractive feature 
because the workloads of query services are highly 
dynamic, and it will be expensive and inefficient to 
serve such dynamic workloads with in-house infras- 
tructures |2j. However, because the service providers 
lose the control over the data in the cloud, data 
confidentiality and query privacy have become the 
major concerns. Adversaries, such as curious service 
providers, can possibly make a copy of the database 
or eavesdrop users' queries, which will be difficult to 
detect and prevent in the cloud infrastructures. 

While new approachesare needed to preserve data 
confidentiality and query privacy, the efficiency of 
query services and the benefits of using the clouds 
should also be preserved. It will not be meaningful 
to provide slow query services as a result of security 
and privacy assurance. It is also not practical for 
the data owner to use a significant amount of in- 
house resources, because the purpose of using cloud 
resources is to reduce the need of maintaining scalable 



in-house infrastructures. Therefore, there is an intri- 
cate relationship among the data confidentiality, query 
privacy, the quality of service, and the economics of 
using the cloud. 

We summarize these requirements for constructing 
a practical query service in the cloud as the CPEL 
criteria: data Confidentiality, query Privacy, Efficient 
query processing, and Low in-house processing cost. 
Satisfying these requirements will dramatically in- 
crease the complexity of constructing query services 
in the cloud. Some related approaches have been 
developed to address some aspects of the problem. 
However, they do not satisfactorily address all of 
these aspects. For example, the crypto-index [12J and 
Order Preserving Encryption (OPE) [1J are vulnerable 
to the attacks. The enhanced crypto-index approach 
1 14 1 puts heavy burden on the in-house infrastructure 
to improve the security and privacy. The New Casper 
approach [24J uses cloaking boxes to protect data ob- 
jects and queries, which affects the efficiency of query 
processing and the in-house workload. We have sum- 
marized the weaknesses of the existing approaches in 
Section 

We propose the RAndom Space Perturbation 
(RASP) approach to constructing practical range 
query and k-nearest-neighbor (kNN) query services in 
the cloud. The proposed approach will address all the 
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four aspects of the CPEL criteria and aim to achieve a 
good balance on them. The basic idea is to randomly 
transform the multidimensional datasets with a com- 
bination of order preserving encryption, dimension- 
ality expansion, random noise injection, and random 
project, so that the utility for processing range queries 
is preserved. The RASP perturbation is designed in 
such a way that the queried ranges are securely 
transformed into polyhedra in the RASP-perturbed 
data space, which can be efficiently processed with the 
support of indexing structures in the perturbed space. 
The RASP kNN query service (kNN-R) uses the RASP 
range query service to process kNN queries. The key 
components in the RASP framework include (1) the 
definition and properties of RASP perturbation; (2) the 
construction of the privacy-preserving range query 
services; (3) the construction of privacy-preserving 
kNN query services; and (4) an analysis of the attacks 
on the RASP-protected data and queries. 

In summary, the proposed approach has a number 
of unique contributions. 

• The RASP perturbation is a unique combination 
of OPE, dimensionality expansion, random noise 
injection, and random projection, which provides 
strong confidentiality guarantee. 

• The RASP approach preserves the topology of 
multidimensional range in secure transformation, 
which allows indexing and efficiently query pro- 
cessing. 

• The proposed service constructions are able to 
minimize the in-house processing workload be- 
cause of the low perturbation cost and high pre- 
cision query results. This is an important feature 
enabling practical cloud-based solutions. 

We have carefully evaluated our approach with syn- 
thetic and real datasets. The results show its unique 
advantages on all aspects of the CPEL criteria. 

The entire paper is organized as follows. In Sec- 
tion |3l we define the RASP perturbation method, 
describe its major properties, and analyze the attacks 
to the RASP perturbed data. We also introduce the 
framework for constructing the query services with 
the RASP perturbation. In Section [4] we describe the 
algorithm for transforming queries and processing 
range queries. In Section |5j the range query service 
is extended to handle kNN queries. When describing 
these two services, we also analyze the attacks on 
the query privacy. Finally, we present some related 
approaches in Section [7] and analyze their weaknesses 
in terms of the CPEL criteria. 

2 Query Services in the Cloud 

This section presents the notations, the system archi- 
tecture, and the threat model for the RASP approach, 
and prepares for the security analysis in later 
sections. The design of the system architecture keeps 
the cloud economics in mind so that most data storage 



and computing tasks will be done in the cloud. The 
threat model makes realistic security assumptions and 
clearly defines the practical threats that the RASP 
approach will address. 

2.1 Definitions and Notations 

First, we establish the notations. For simplicity, we 
consider only single database tables, which can be the 
result of denormalization from multiple relations. A 
database table consists of n records and d searchable 
attributes. We also frequently refer to an attribute as 
a dimension or a column, which are exchangeable in 
the paper. Each record can be represented as a vector 
in the multidimensional space, denoted by low case 
letters. If a record x is d-dimensional, we say x G M. d , 
where M. d means the d-dimensional vector space. A 
table is also treated as a d x n matrix, with records 
represented as column vectors. We use capital letters 
to represent a table, and indexed capital letters, e.g., 
Xi, to represent columns. Each column is defined 
on a numerical domain. Categorical data columns 
are allows in range query, which are converted to 
numerical domains as we will describe in Section [3] 

Range query is an important type of query for many 
data analytic tasks from simple aggregation to more 
sophisticated machine learning tasks. Let T be a table 
and Xi, Xj, and Xk be the real valued attributes in 
T, and a and b be some constants. Take the counting 
query for example. A typical range query looks like 
select count(*) from T 

where Xi e [a», 6,-] and Xj e (a,j,bj) and Xk — 

which calculates the number of records in the range 
defined by conditions on Xi, Xj, and Xk- Range 
queries may be applied to arbitrary number of at- 
tributes and conditions on these attributes combined 
with conditional operators "and"/"or". We call each 
part of the query condition that involves only one 
attribute as a simple condition. A simple condition like 
Xi G [<x», &i] can be described with two half space 
conditions Xi < bi and — Xi < —a;. Without loss 
of generality, we will discuss how to process half 
space conditions like Xi < bi in this paper. A slight 
modification will extend the discussed algorithms to 
handle other conditions like Xi < bi and Xi = bi. 

kNN query is to find the closest k records to the 
query point, where the Euclidean distance is often 
used to measure the proximity. It is frequently used 
in location-based services for searching the objects 
close to a query point, and also in machine learning 
algorithms such as hierarchical clustering and kNN 
classifier. A kNN query consists of the query point 
and the number of nearest neighbors, k. 

2.2 System Architecture 

We assume that a cloud computing infrastructure, 
such as Amazon EC2, is used to host the query 
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Fig. 1 . The system architecture for RASP-based query 
services. 



services and large datasets. The purpose of this ar- 
chitecture is to extend the proprietary database servers 
to the public cloud, or use a hybrid private-public 
cloud to achieve scalability and reduce costs while 
maintaining confidentiality. 

Each record x in the outsourced database con- 
tains two parts: the RASP-processed attributes D 1 = 
F(D,K) and the encrypted original records, Z = 
E(D,K'), where K and K 1 are keys for perturbation 
and encryption, respectively. The RASP-perturbed 
data D' are for indexing and query processing. Figure 
[1] shows the system architecture for both RASP-based 
range query service and kNN service. 

There are two clearly separated groups: the trusted 
parties and the untrusted parties. The trusted parties 
include the data/service owner, the in-house proxy 
server, and the authorized users who can only submit 
queries. The data owner exports the perturbed data to 
the cloud. Meanwhile, the authorized users can sub- 
mit range queries or kNN queries to learn statistics or 
find some records. The untrusted parties include the 
curious cloud provider who hosts the query services 
and the protected database. The RASP-perturbed data 
will be used to build indices to support query process- 
ing. 

There are a number of basic procedures in this 
framework: (1) F(D) is the RASP perturbation that 
transforms the original data D to the perturbed data 
D'; (2) Q(q) transforms the original query q to the 
protected form q' that can be processed on the per- 
turbed data; (3) H(q\D') is the query processing al- 
gorithm that returns the result R'. When the statistics 
such as SUM or AVG of a specific dimension are 
needed, RASP can work with partial homomorphic 
encryption such as Paillier encryption 1251 to compute 
these statistics on the encrypted data, which are then 
recovered with the procedure G(R'). 

2.3 Threat Model 

Assumptions. Our security analysis is built on the im- 
portant features of the architecture. Under this setting, 
we believe the following assumptions are appropriate. 
• Only the authorized users can query the propri- 
etary database. Authorized users are not mali- 
cious and will not intentionally breach the confi- 
dentiality We consider insider attacks are orthog- 



onal to our research; thus, we can exclude the 
situation that the authorized users collude with 
the untrusted cloud providers to leak additional 
information. 

• The client-side system and the communication 
channels are properly secured and no protected 
data records and queries can be leaked. 

• Adversaries can see the perturbed database, the 
transformed queries, the whole query processing 
procedure, the access patterns, and understand 
the same query returns the same set of results, 
but nothing else. 

• Adversaries can possibly have the global infor- 
mation of the database, such as the applications 
of the database, the attribute domains, and pos- 
sibly the attribute distributions, via other pub- 
lished sources (e.g., the distribution of sales, or 
patient diseases, in public reports). 

These assumptions can be maintained and rein- 
forced by applying appropriate security policies. Note 
that this model is equivalent to the eavesdropping 
model equipped with the plaintext distributional 
knowledge in the cryptographic setting. 

Protected Assets. Data confidentiality and query 
privacy should be protected in the RASP approach. 
While the integrity of query services is also an im- 
portant issue, it is orthogonal to our study. Existing 
integrity checking and preventing techniques 1 34], 
11301 , [19J can be integrated into our framework. Thus, 
the integrity problem will be excluded from the paper, 
and we can assume the curious cloud provider is 
interested in the data and queries, but it will hon- 
estly follow the protocol to provide the infrastructure 
service. 

Attacker Modeling. The goal of attack is to re- 
cover (or estimate) the original data from the per- 
turbed data, or identify the exact queries (i.e., location 
queries) to breach users' privacy According to the 
level of prior knowledge the attacker may have, we 
categorize the attacks into two categories. 

• Level 1: The attacker knows only the per- 
turbed data and transformed queries, without 
any other prior knowledge. This corresponds to 
the cipertext-only attack in the cryptographic set- 
ting. 

• Level 2: The attacker also knows the original 
data distributions, including individual attribute 
distributions and the joint distribution (e.g., the 
covariance matrix) between attributes. In prac- 
tice, for some applications, whose statistics are 
interesting to the public domain, the dimen- 
sional distributions might have been published 
via other sources. 

These levels of knowledge are appropriate according 
to the assumptions we hold. We will analyze the 
security based on this threat model. 
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Security Definition. Different from the traditional 
encryption schemes, attackers can also be satisfied 
with good estimation. Therefore, we will investigate 
two levels of security definitions: (1) it is computa- 
tionally intractable for the attacker to recover the exact 
original data based on the perturbed data; (2) the at- 
tacker cannot effectively estimate the original data. The 
effectiveness measure is defined with the NR_MSE 
measure in Section l3~3l 

3 RASP: Random Space Perturbation 

In this section, we present the basic definition of 
RAndom Space Perturbation (RASP) method and its 
properties. We will also discuss the attacks on RASP 
perturbed data, based on the threat model given in 
Section |2 

3.1 Definition of RASP 

RASP is one type of multiplicative perturbation, with 
a novel combination of OPE, dimension expansion, 
random noise injection, and random projection. Let's 
consider the multidimensional data are numeric and 
in multidimensional vector spaced- The database has 
k searchable dimensions and n records, which makes 
a d x n matrix X. The searchable dimensions can be 
used in queries and thus should be indexed. Let x 
represent a e?-dimensional record, x £ R d . Note that 
in the d-dimensional vector space R d , the range query 
conditions are represented as half-space functions and 
a range query is translated to finding the point set in 
corresponding polyhedron area described by the half 
spaces [4J. 

The RASP perturbation involves three steps. Its 
security is based on the existence of random invertible 
real-value matrix generator and random real value 
generator. For each fc-dimensional input vector x, 

1) An order preserving encryption (OPE) scheme 
HI, E ope with keys K ope/ is applied to each 
dimension of x: E ope (x, K ope ) E R d to change 
the dimensional distributions to normal distri- 
butions with each dimension's value order still 
preserved. 

2) The vector is then extended to d + 2 dimensions 
as G{x) = ((E opt (x)) T , I, v) T , where the (d + 1)- 
th dimension is always a 1 and the (d + 2)- 
th dimension, v, is drawn from a random real 
number generator RNG that generates random 
values from a tailored normal distributions. We 
will discuss the design of RNG and OPE later. 

3) The (d + 2) -dimensional vector is finally trans- 
formed to 

F(x, K = {A, K ope , RG}) = A((E ope (x)) T , 1, vf ', 

(1) 

1. For categorical attributes, we use the following simple map- 
ping because it will not break the query semantics. For a categorical 
attribute Xi, the values {ci, . . . , c m } in the domain are mapped to 
{1, . . . , m}. A query condition on categorical values, say Xi = Cj, 
is then converted to j — S < Xi < j + 6, where 5 6 (0, 1) 



where A is a (d+2) x (d+2) randomly generated 
invertible matrix with e R such that there 
are at least two non-zero values in each row of 
A and the last column of A is also non-zerc@. 
K ope and A are shared by all vectors in the database, 
but v is randomly generated for each individual 
vector. Since the RASP-perturbed data records are 
only used for indexing and helping query processing, 
there is no need to recover the perturbed data. As 
we mentioned, in the case that original records are 
needed, the encrypted records associated with the 
RASP-perturbed records will be returned. We give the 
detailed algorithm in Appendix. 

Design of OPE and RNG. We use the OPE scheme 
to convert all dimensions of the original data to the 
standard normal distribution Af(0, 1) in the limited 
domain [—j3,0]. (3 can be selected as a value >= 4, 
as the range [—4, 4] covers more than 99% of the 
population. This can be done with an algorithm such 
as the one described in (TJ. The use of OPE allows 
queries to be correctly transformed and processed. 
Similarly, we draw random noises v from Af(0, 1) in 
the limited domain [ — /3, /?]. Such a design makes the 
extended noise dimension indifferent from the data 
dimensions in terms of the distributions. 

The design of such an extended data vector 
(E ope (x) T , 1, v) T is to enhance the data and query 
confidentiality. The use of OPE is to transform large- 
scale or infinite domains to normal distributions, 
which address the distributional attack. The (d+ l)-th 
homogeneous dimension is for hiding the query con- 
tent. The (d + 2)-th dimension injects random noise in 
the perturbed data and also protects the transformed 
queries from attacks. The rationale behind different 
aspects will be discussed clearly in later sections. 

3.2 Properties of RASP 

RASP has several important features. First, RASP does 
not preserve the order of dimensional values be- 
cause of the matrix multiplication component, which 
distinguishes itself from order preserving encryption 
(OPE) schemes, and thus does not suffer from the 
distribution-based attack (details in Section 0. An 
OPE scheme maps a set of single-dimensional values 
to another, while keeping the value order unchanged. 
Since the RASP perturbation can be treated as a 
combined transformation F(G(E ope (x))), it is suffi- 
cient to show that F(y) = Ay does not preserve the 
order of dimensional values, where y 6 R d+2 and 
A e R( d+2 ) x ( d + 2 ). The proof is straightforward as 
shown in Appendix. 

Second, RASP does not preserve the distances be- 
tween records, which prevents the perturbed data 

2. Currently, we use a random invertible matrix generator that 
draws matrix elements uniformly at random from the standard 
normal distribution and check the matrix invertibility and the non- 
zero conditions. 
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from distance-based attacks |8|. Because none of the 
transformations in the RASP: E ope , G, and F preserves 
distances, apparently the RASP perturbation will not 
preserve distances. Similarly, RASP does not preserve 
other more sophisticated structures such as covariance 
matrix and principal components [18 J. Therefore, the 
PCA-based attacks such as (16), (2D) do not work as 
well. 

Third, the original range queries can be transformed 
to the RASP perturbed data space, which is the ba- 
sis of our query processing strategy. A range query 
describes a hyper-cubic area (with possibly open 
bounds) in the multidimensional space. In Section |H 
we will show that a hyper-cubic area in the original 
space is transformed to a polyhedron with the RASP 
perturbation. Thus, we can search the points in the 
polyhedron to get the query results. 

3.3 Data Confidentiality Analysis 

As the threat model describes, attackers might be 
interested in finding the exact original data records 
or estimating them based on the perturbed data. 
For estimation attack, if the estimation is sufficiently 
accurate (above certain accuracy threshold), we say 
the perturbation is not secure. Below, we define the 
measure for evaluating the effectiveness of estimation 
attacks. 

3.3. 1 Evaluating Effectiveness of Estimation Attacks 

Because attackers may not need to exactly recover the 
original values, an accurate estimation will be suffi- 
cient. A measure is needed to define the "accuracy" or 
"uncertainty" as we mentioned. We use the commonly 
used mean-squared-error (MSE) to evaluate the effec- 
tiveness of attack. To be semantically consistent, the 
j-th dimension can be treated as sample values drawn 
from a random variable Xj . Let Xi j be the value of the 
i-th original record in j-th dimension and Xij be the 
estimated value. The MSE for the j-th dimension can 
be defined as 



1 - 

MSE(X j ,X j ) = ~J2(* 



Xij) 



which is equivalent to the variance: var(X, —Xj). The 
square root of MSE (RMSE) represent the uncertainty 
of the estimation - for an estimated value x, the 
original value x could be in the range (x - RMSE, x 
+RMSE). Thus, the length of the range, 2*RMSE, also 
represents the accuracy of the estimation. 

However, this length is subject to the length of the 
domain. Thus, we use the normalized square root of 
MSE (NR_MSE). 



NR_MSE(Xj) = 2JMSE(X h Xj)/dornam. length, 

(2) 

instead, which is intuitively the rate between the 
uncertain range and the whole domain. 



To compare MSE for multiple columns, we also 
need to normalize these two series {x^} and {x^} 
to eliminate the difference on domain scales. The 
normalization procedure llTD is described as follows. 
Assume the mean and variance of the series {x^} is 
Hj and <rj, correspondingly. The series is transformed 



by 



[Xij [lj )/(Jj. A similar procedure is also ap- 



plied to the series {xij}. For the normalized domains, 
the range [—2,2] almost covers the whole population^ 
ITTT1 . Therefore, for normalized series, NR_MSE is 
simply RMSE/2. 

For an attack that can only result in low-accuracy 
estimation (e.g., NR_MSE > 20%, the uncertainty 
is more than 20 % of the domain length.), we call 
the RASP-perturbed dataset is resilient to that attack. 
Intuitively, NR_MSE higher than 100% will not be 
very meaningful. Thus, we set the absolute upper 
bound to be 100%. We will discuss the specific upper 
bounds according to the level of prior knowledge. 

3.3.2 Prior-Knowledge Based Analysis 

Below, we analyze the security under the two levels 
of knowledge the attacker may have, according to the 
two levels of security definitions: exact match and 
statistical estimation. 

Naive Estimation. We assume each value in the vec- 
tor or matrix is encoded with n bits. Let the perturbed 
vector p be drawn from a random variable V , and the 
original vector x be drawn from a random variable 
X. We show that naive estimation is computationally 
intractable to identify the exact original data with the 
perturbed data, if we use a random invertible real 
matrix generator and a random real value generator. 
The goal is to show the number of valid X dataset 
in terms of a known perturbed dataset P. Below we 
discuss a simplified version that contains no OPE 
component - the OPE version has at least the same 
level of security. 

Proposition 1: For a known perturbed dataset P, 
there exists 0(2 (d+1 ^ d+2 *> n ) candidate X datasets in 
the original space. 

Proof: For a given perturbation P = AZ, where 
Z is X with the two extended dimensions, we use 
Bd+i to represent the (d + l)-th row of A~ x . Thus, 
Bd+\P = [1, . . . , 1], i.e., the appended (e?+l)-th row of 
Z. Keeping B>d+i unchanged, we randomly generate 
other rows of B for a candidate B. The result Z = BP 
is a validate estimate of Z if B is invertible. Thus, the 
number of candidate X is the number of invertible B. 

The total number of B including non-invertible 
ones is 2( d+1 )( d + 2 )". Based on the theory of invertible 
random matrix [28], the probability of generating a 
non-invertible random matrix is less than exp -c ( d+2 ) 

3. For a normal distribution N(fj,,a 2 ), the range (fi — 2a, fj, + 
2a) covers about 95% of the population. We use this length 4<r 
to approximately represent the majority of population for all other 
distributions, as normal distribution is a good approximation for 
many applications. 
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for some constant c. Thus, there are about (1 — 

exp -c(c2+2)) 2 (d+i)(d+2)n mverti bl e B. Correspondingly, 

there are a same number of candidate X. □ 
Thus, finding the exact X has a negligible probability 
in terms of the number of bits, n. 

As the candidates have an equal probability over 
the whole domain, according to the definition of 
NR_MSE, the uncertain range is the same as the whole 
domain, resulting in NRJVISE = 100%. 
Distribution-based Estimation. With the known dis- 
tributional information, the attacker can do more on 
estimating the original data. The known most relevant 
method is called Independent Component Analysis 
(ICA) [17J. For a multiplicative perturbation P = AX, 
the basic idea is to find an optimal projection, wP, 
where w is a d + 2 dimension row vector, to result in 
a row vector with its value distribution close to that 
of one original attribute. It can be extended to find 
a matrix W, so that WP gives independent and non- 
gaussian rows, i.e., a good estimate of X. 

The ICA algorithms ITTI , tl3l are optimization al- 
gorithms that try to find such projections by maxi- 
mizing the non-gaussianit^ of the projection wP. The 
non-gaussianity of the original attributions is crucial 
because any projection of a multidimensional normal 
distribution is still a normal distribution, which leaves 
no clue for recovery. 

Therefore, with our design of OPE and the noise 
dimension in Section |3j we have the following result. 

Proposition 2: There are 0(2 dn ) candidate projec- 
tion vectors, w, that lead to the same level of non- 
gaussianity. 

Proof: The OPE encrypted matrix X (with the 
homogeneous dimension excluded, which can be pos- 
sibly recovered) can be treated as a sample set drawn 
from a multivariate normal distribution M(fx, £). Any 
invertible transformation P = AX will result in an- 
other multivariate normal distribution J\f(Ap, AT,A T ). 
Thus, any projection wP will not change the gaussian- 
ity, and there are 0(2 dn ) such candidates of w. □ 
Thus, the probability to identify the right projection 
is negligible in terms of the number of bits n. This 
shows that any ICA-style estimation that depends 
on non-guassianity is equally ineffective to the RASP 
perturbation. 

In addition to ICA, Principal Component Analysis 
(PCA) based attack is another possible distributional 
attack, which, however, depends on the preservation 
of covariance matrix [20]. Because the covariance 
matrix is not preserved in RASP perturbation, the 
PCA attack cannot be used on RASP perturbed data. 
It is unknown whether there are other distributional 
methods for approximately separating X or A from 
the perturbed data P, which will be studied in the 
ongoing work. 

4. Non-gaussianity means the distribution is not normal distri- 
bution. 



In the worst-case estimation, the attacker can sim- 
ply draw a sample of Xj from the known distribution 
of the original Xj; thus, Xj and Xj are independent 
but have the same distribution. It follows that MSE = 

var{Xj — Xj) = var(Xj) + var(Xj) — 2var(Xj) = 
2a 2 . Correspondingly, NR_MSE = (2y/MSE) / '(4a) = 
V2/2«71%. 

4 RASP Range-Query Processing 

Based on the RASP perturbation method, we design 
the services for two types of queries: range query and 
kNN query. This section will dedicate to range query 
processing. We will first show that a range query in 
the original space can be transformed to a polyhedron 
query in the perturbed space, and then we develop a 
secure way to do the query transformation. Then, we 
will develop a two-stage query processing strategy for 
efficient range query processing. 

4.1 Transforming Range Queries 

Let's look at the general form of a range query 
condition. Let be an attribute in the database. A 
simple condition in a range query involves only one 
attribute and is of the form "X. L <op> a" , where at 
is a constant in the normalized domain of and 
op 6 {<,>,=,<,>, 7^} is a comparison operator. For 
convenience we will only discuss how to process 
Xi < di, while the proposed method can be slightly 
changed for other conditions. Any complicated range 
query can be transformed into the disjunction of a 
set of conjunctions, i.e., U^=i (PlI^Li d,j)> where m,n 
are some integers depending on the original query 
conditions and C;j is a simple condition about Xj. 
Again, to simplify the presentation we restrict our 
discussion to a single conjunction condition fl^Cj, 
where C L is in form of 6j < X.- L < a,. Such a 
conjunction conditions describes a hyper-cubic area 
in the multidimensional space. 

According to the three nested transformations in 
RASP F(G{E ope {x))), we will first show that an OPE 
will transform the original hyper-cubic area to another 
hyper-cubic area in the OPE space. 

Proposition 1: Order preserving encryption func- 
tions transform a hyper-cubic query range to another 
hyper-cubic query range. 

Proof: The original range query condition consists 
of simple conditions like h < Xi < for each 
dimension. Since the order is preserved, each sim- 
ple condition is transformed as follows: E ope (bi) < 
E ope (Xi) < E ope (ai), which means the transformed 
range is still a hyper-cubic query range. □ 

Let y = E ope (x) and a = E ope (a,i). A simple 
condition Yi < c; defines a half-space. With the 
extended dimensions z T = (y T ,l,v), the half-space 
can be represented as w T z < 0, where w is a d + 2 
dimensional vector with tUj = l,Wd + i — — cj, and 
Wj = for j ^ i, d + 1. Finally, let u = Az, according 
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Fig. 2. R-tree index. two-stage processing al- 
gorithm. 

to the RASP transformations. With this representation, 
the original condition is equivalent to 

w T A~ 1 u < (3) 

in the RASP-perturbed space, which is still a half- 
space condition. However, this half-space condition 
will not be parallel to the coordinate - these trans- 
formed conditions together form a polyhedron (as 
illustrated in Figure |3] The query service will need 
to find the records in the polyhedron area, which is 
supported by the two-stage processing algorithm. 

4.2 Security Enhancement on Query Transforma- 
tion 

The attacker may also target on the transformed 
queries. In this section we discuss such attacks and 
describe the methods countering the attacks. Note that 
the attack on small ranges will be described in kNN 
query processing. 

Countering Dimensional Selection Attack We show 
that the dimensional selection attack can reveal par- 
tial information of the selected data dimensions, if 
the attacker knows the distribution of the dimen- 
sion. Assume the query condition is applied to the 
i-th dimension. If the query parameter w T A^ 1 is 
directly submitted to the cloud side, the server can 
apply w T A^ 1 to each record u in the server, and get 
w T A~ 1 u = E ope {xi) — E ope {a,i), where Xi is the i- 
th dimension of the corresponding original record x. 
After getting all such values for the dimension i, with 
the known original data distributions, the attacker can 
apply the bucket-based distributional attack on the 
OPE encrypted data (see Section [7)1 to get an accurate 
estimate. 

According to the design of noise, the extended (d + 
2)-th dimension v in the RASP perturbation: F(x) = 
A(E ope (x) T , 1, v) T is always greater than vq, which 
can be used to construct secure query conditions. In- 
stead of processing a half space condition E ope (Xi) < 
we use (E ope (Xi) — E ope (<n))(v — v ) < 
instead. These two conditions are equivalent because 
v always satisfies v > vo- Using the similar transfor- 
mations, we get E ope (Xi) — E ope {cn) = w T A~ 1 u and 
v = q L T A^ 1 u, where q d+2 = -1, qd+i = v , and qj = 0, 
for j d. Thus, we get the transformed quadratic 
query condition 

u T {A- l ) T wq T A- 1 u < 0. (4) 



Let 0j = (A~ 1 ) T wq T A~ 1 . Now is submitted to the 
server and the server will use u T <diU < to filter out 
the results. 

We now show that this query transformation is 
resilient to the dimensional selection attack. Apply- 
ing u T Qu to each record u, we get (E ope (Xi) — 
E ope (ai))(v — Vo). Since v is randomly chosen for each 
record, the value E ope (Xi) — E ope (a,i) is protected by 
the randomization. O, does not reveal the key param- 
eters as well. Let q = E ope (a,i) and a* be the i-th row 
of A^ 1 . 0j is (a* - Cia d+1 ) T (v Q a d +i - a d+2 ). As all the 
components: a,i,Ci,ad+i, and a d +2 are unknown and 
cannot be further reduced, 0; provide no information 
to help drive information about A -1 . 
Other Potential Threats. Because the query transfor- 
mation method does not introduce randomness - the 
same query will always get the same transformation, 
and thus the confidentiality of access pattern is not 
preserved. We summarize the leaked information re- 
lated to access patterns as follows. 

« Attackers know the exact frequency of each trans- 
formed query. 

• The set relationships (set intersection, union, dif- 
ference, etc.) between the query results are re- 
vealed as a result of exact range query processing. 

• Some query matrices on the same dimension may 
have special relationship preserved as shown in 
Proposition [3j which we will discuss later. 

We admit this is a weakness of the current design. 
However, according to the threat model, the adversary 
will not know any of the original data and queries. 
Thus, by simply observing the query frequency or re- 
lationships between queries, one cannot derive useful 
information. An important future work is to formally 
define the specific information leakage caused by the 
leaked query and access patterns, and then precisely 
analyze the data and query confidentiality affected 
by this information leakage under different security 
assumptions. 

4.3 A Two-Stage Query Processing Strategy with 
Multidimensional Index Tree 

With the transformed queries, the next important task 
is to process queries efficiently and return precise 
results to minimize the client-side post-processing 
effects. A commonly used method is to use multi- 
dimensional tree indices to improve the search per- 
formance. However, multidimensional tree indices 
are normally used to process axis-aligned "bounding 
boxes"; whereas, the transformed queries are in ar- 
bitrary polyhedra, not necessarily aligned to axes. In 
this section, we propose a two-stage query processing 
strategy to handle such irregular-shape queries in the 
perturbed space. 

Multidimensional Index Tree. Most multidimen- 
sional indexing algorithms are derived from R-tree 
like algorithms l22l , where the axis-aligned minimum 
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bounding region (MBR) is the construction block for 
indexing the multidimensional data. For 2D data, an 
MBR is a rectangle. For higher dimensions, the shape 
of MBR is extended to hyper-cube. Figure |2] shows the 
MBRs in the R-tree for a 2D dataset, where each node 
is bounded by a node MBR. The R-tree range query 
algorithm compares the MBR and the queried range 
to find the answers. 

The Two-Stage Processing Algorithm. The trans- 
formed query describes a polyhedron in the perturbed 
space that cannot be directly processed by multi- 
dimensional tree algorithms. New tree search algo- 
rithms could be designed to use arbitrary polyhedron 
conditions directly for search. However, we use a 
simpler two-stage solution that keeps the existing tree 
search algorithms unchanged. 

At the first stage, the proxy in the client side finds 
the MBR of the polyhedron (as a part of the submitted 
transformed query) and submit the MBR and a set of 
secured query conditions {6i,...,6 m } to the server. 
The server then uses the tree index to find the set of 
records enclosed by the MBR. 

The MBR of the polyhedron can be efficiently 
founded based on the original range. The original 
query condition constructs a hyper-cube shape. With 
the described query transformation, the vertices of the 
hyper cube are also transformed to vertices of the 
polyhedron. Therefore, the MBR of the vertices is also 
the MBR of the polyhedron (27). Figure [3] illustrates 
the relationship between the vertices and the MBR 
and the two-stage processing strategy. 

At the second stage, the server uses the transformed 
halfspace conditions to filter the initial result. In most 
cases of tight ranges, the initial result set will be 
reasonably small so that it can be filtered in mem- 
ory by simply checking the transformed half-space 
conditions. However, in the worst case, the MBR of 
the polyhedron will possibly enclose the entire dataset 
and the second stage is reduced to a linear scan of the 
entire dataset. The result of second stage will return 
the exact range query result to the proxy server, which 
significantly reduces the post-processing cost that the 
proxy server needs to take. It is very important to the 
cloud-based service, because low post-processing cost 
requires low in-house investment. 

5 KNN Query Processing with RASP 

Because the RASP perturbation does not preserve 
distances (and distance orders), kNN query cannot be 
directly processed with the RASP perturbed data. In 
this section, we design a kNN query processing algo- 
rithm based on range queries (the kNN-R algorithm). 
As a result, the use of index in range query processing 
also enables fast processing of kNN queries. 

5.1 Overview of the kNN-R Algorithm 

The original distance-based kNN query processing 
finds the nearest k points in the spherical range that 



is centered at the query point. The basic idea of our 
algorithm is to use square ranges, instead of spherical 
ranges, to find the approximate kNN results, so that 
the RASP range query service can be used. There 
are a number of key problems to make this work 
securely and efficiently. (1) How to efficiently find 
the minimum square range that surely contains the k 
results, without many interactions between the cloud 
and the client? (2) Will this solution preserve data 
confidentiality and query privacy? (3) Will the proxy 
server's workload increase? to what extent? 

The algorithm is based on square ranges to approx- 
imately find the kNN candidates for a query point, 
which are defined as follows. 

Definition 1: A square range is a hyper-cube that 
is centered at the query point and with equal-length 
edges. 

Figure [5] illustrates the range-query-based kNN pro- 
cessing with two-dimensional data. The Inner Range 
is the square range that contains at least k points, 
and the Outer Range encloses the spherical range 
that encloses the inner range. The outer range surely 
contains the kNN results (Proposition |2j but it may 
also contain irrelevant points that need to be filtered 
out. 

Proposition 2: The kNN-R algorithm returns results 
with 100% recall. 

Proof: The sphere in Figure [5] between the outer 
range and the inner range covers all points with dis- 
tances less than the radius r. Because the inner range 
contains at least k points, there are at least k nearest 
neighbors to the query points with distances less than 
the radius r. Therefore, the k nearest neighbors must 
be in the outer range. □ 

The kNN-R algorithm consists of two rounds of 
interactions between the client and the server. Figure 2] 
demonstrates the procedure. (1) The client will send 
the initial upper-bound range, which contains more 
than k points, and the initial lower-bound range, 
which contains less than k points, to the server. The 
server finds the inner range and returns to the client. 
(2) The client calculates the outer range based on the 
inner range and sends it back to the server. The server 
finds the records in the outer range and sends them 
to the client. (3) The client decrypts the records and 
find the top k candidates as the final result. 

If the points are approximately uniformly dis- 
tributed, we can estimate the precision of the returned 
result. With the uniform assumption, the number of 
points in an area is proportional to the size of the area. 
If the inner range contains m points, m >— k, the 
outer range contains q points, and the dimensionality 
is d, we can derive q = 2 d / 2 m. Thus, the precision is 
k/q = kj (2 rf / 2 m). If m s» k and d = 2, the precision is 
around 0.5. When d increases, the precision decreases 
exponentially due to the curse of dimensionality |23|, 
which suggests kNN-R should not work effectively on 
high-dimensional data. We will show this weakness in 
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experiments. 

5.2 Finding Compact Inner Square Range 

An important step in the kNN-R algorithm is to 
find the compact inner square range to achieve high 
precision. In the following, we give the (k, <5)-range 
for efficiently finding the compact inner range. 

Definition 2: A (k, 5)-range is any square range cen- 
tered at the query point, the number of points in 
which is in the range [k,k + 5], 5 is a nonnegative 
integer. 

We design an algorithm similar to binary search 
to efficiently find the (fc, S) -range. Suppose a square 
range centered at the query point with length of L 
in each dimension is represented as S^ L \ Let the 
number of points included by this range is iV' L '. If 
a square range S r ' m ) is enclosed by another square 
range S {out) , we say S im) C S {out) . It directly follows 
that < N(° ut \ and also 

Corollary 1: If iV« < A^ 2 ), C S<?\ 
Using this definition and notation, we can always 
construct a series of enclosed square ranges centered 
on the query point: 5 (Ll) C C ...,C S (Lm \ 

Correspondingly, the numbers of points enclosed by 
{S^} have the ordering < < ... N^ Lm K 

Assume that 5^ Ll ^ is the initial range containing less 
than k points and S'( Lm ) is the initial upper bound 
range; both are sent by the client. The problem of 
finding the compact inner range S can be mapped 
to a binary search over the sequence {S^^}. 

In each step of the binary search, we start with a 
lower bound range, denoted as S^ low ' and a higher 
bound range, S' ht9h K We want the corresponding 
numbers of enclosed points to satisfy N^ low ^ < k < 
jy(high) m eacn s tep, which is achieved with the 
following procedure. First, we find the middle square 
range S^ mU \ where mid = (low + high)/ 2. If S (mid) 
covers no less than k points, the higher bound: gC"^) 
is updated to S^ mid ^; otherwise, the lower bound: 
s (iow) is updated to S^ mU \ At the beginning step 
S {low) is set to 5 (Ll) and S {hl9h) is S^l This process 
repeats until N^ nld ^ < k + 6 or high — low < E, 



where £ is some small positive number. Algorithm 
2] in Appendix describes these steps. 

Selection of Initial Inner/Outer Bounds. The se- 
lection of initial inner bound can be the query point. 
If the query point is q(qx, ■ ■ ■ ,q<i), S( Ll > is a hyper- 
cube defined by {qi > Xi > qi,i = 1 . . . d}. The naive 
selection of S^ Lm ' would be the whole domain. How- 
ever, we can effectively reduce the range with a coarse 
density map organized in a tiny flat multidimensional 
tree, which can be included in the preprocessing step 
in the client side. The details will be ignored due to 
the space limitation. 

5.3 Finding Inner Range with RASP Perturbed 
Data 

Algorithm |4] gives the basic ideas of finding the com- 
pact inner range in iterations. There are two critical 
operations in this algorithm: (1) finding the number 
of points in a square range and (2) updating the higher 
and lower bounds. Because range queries are secured 
in the RASP framework, the key is to update the 
bounds with the secured range queries, without the 
help of the client-side proxy server. 

As discussed in the RASP query processing, a range 
query such as is encoded as the MBR^) of its 
polyhedron range in the perturbed space and the 2(d+ 
2) dimensional conditions. y T <d\ L ^y < determining 
the sides of the polyhedron, and each of the d + 2 
extended dimensions gets a pair of conditions for the 
upper and lower bounds, respectively. 

The problem of binary range search is to use the 
higher bound range 

gihigh) and the lower bound 
range S^ low) to derive S^ lld) . When all of these 
ranges are secured, the problem is transformed to 
(1) deriving from ef ifl/l) and ef ow) ; and (2) 

deriving MBR( mid ) from MBR^^ h ~> and MBR ( ' otu ). The 
following discussion will be focused on the simplified 
RASP version without the OPE component, which 
will be extended with the OPE component. 

We show that 

Proposition 3: 
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Proof: Remember that 0; for X, < c, can be 
represented as (a. t - c t a d+1 ) T (v a d +i - a d+2 ), where 
a. L is the i-th row of the matrix A. Let the conditions 
be Xi < h, X; < I, and Xj < (h + l)/2 for the high, low, 
and middle bounds, correspondingly. Thus, (8 
ef ow) )/2 = ( ai - ((h + l)/2)a d+1 ) T (v a d . 
which is Q^ nld \ 



(high) 



□ 

As we have mentioned, the MBR of an arbitrary 
polyhedron can be derived based on the vertices of 
the polyhedron. A polyhedron is mapped to another 
polyhedron after the RASP perturbation. Concretely, 
let a polyhedron P has m vertices {xi, . . . , x m }, which 
are mapped to the vertices in the perturbed space: 
{yx, ■ ■ ■ ,Vm}- Then, the upper bound and lower bound 
of dimension j of the MBR of the polyhedron in 
the perturbed space are determined by max{j/jj-,i = 
1 . . . to} and min{j/ij, i = 1 . . . to}, respectively. 

Let the j-th dimension of MBR^ represented as 

[4%ni s fiLax]> where s fmin = m™{Vi?\i = l...m}, 
and s^l„„ = max{y. '"" 



= 1 . . . to}. Now we 
choose the MBR^ MID > as follows: for j-th dimension 



j,max 



Jhigh) 



)A0 



(low) 
j,max 



^(high) 



j,max 



)/2]. We 



r/ (low) 

we use [(s} imj ; 
show that 

Proposition 4: MBR( MID ^ encloses MBR<- mid \ 
The details of proof can be found in Appendix. Be- 
cause the MBR is only used for the first stage of range 
query processing, a slightly larger MBR still encloses 
the polyhedron, which guarantees the correctness of 
the two-stage range query processing. 

Including the OPE component. The results on 
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mid) 



and MBR< 



MID) 



can be extended to the RASP 



scheme with the OPE component. However, due to 
the introduction of the order preserving function 
the middle point may not be strictly the middle point, 
but somewhere between the higher bound and lower 
bound. We use "between" (btw) to denote it. 

Specifically, if Xi < h and Xi < I are the corre- 
sponding conditions for the higher and lower bounds. 
Let the condition for the "between" bound be Xi < b 
that satisfies f t {b) = (f z (h) + f l {l))/2. According to 
the OPE property, we have I < b < h, i.e., the 
corresponding range is still between the lower range 
and higher range. Therefore, the same binary search 
algorithm can still be applied, according to Corollary 
[TJ The server can also derive (Q^ 19 ^ + Q^ low 'y2 = 
fa - ((fi(h) + f l (l))/2)a d+1 ) T (v a d+1 - a d+2 ) = Qf w , 
a result similar to Proposition [3] 

Similarly, we define MBR( STW > with 

+ M^W/S and 

(low 



f , s (BTW): _ 



fM 



) + fi(&))/2, while 



_ , J- loW ) 

J i,max ) 
SBTW), = 

MBR( bttu ) is defined based on the vertices to be 
consistent with Q\ btw Because the relationships Eq. 
[6] and [7] in Appendix are still true with the OPE 
transformation we can prove that MBR( BTW ^ 

also encloses MBR^'™). Due to the space limitation, 
we skip the details. 



5.4 Defining Initial Bounds 

The complexity of the (k, (5)-range algorithm is deter- 
mined by the initial bounds provided by the client. 
Thus, it is important to provide compact ones to 
help the server process queries more efficiently. The 
initial lower bound is defined as the query point. 
For q(qi, . . . , q d ), the dimensional bounds are simply 
q 3 < Xj < q r 

The higher bounds can be defined in multiple ways. 
(1) Applications often have a user-specified interest 
bound, for example, returning the nearest gas station 
in 5 miles, which can be used to define the higher 
bound. (2) We can also use center-distance based 
bound setting. Let the query point has a distance 7 
to the distribution center - as we always work on 
normalized distributions, the center is (0, ... ,0). The 
upper bound is defined as Qj — e-f < Xj < qj + ej, 
where epsilon £ (0, 1] defines the level of conservativ- 
ity. (3) If it is really expected to include all candidate 
kNN regardless how distant they are, we can include 
a rough density-map (a multidimensional histgram) 
for quickly identifying the appropriate higher bound. 
However, this method works best for low dimensional 
data as the number of bins exponentially increases 
with the number of dimensions. In experiments, we 
simply use the method (1) and 5% of the domain 
length for the extension. 

5.5 Security of kNN Queries 

As all kNN queries are completely transformed to 
range queries, the security of kNN queries are equiva- 
lent to the security of range queries. According to the 
previous discussion in Section 14.21 the transformed 
range queries are secure under the assumptions. 
Therefore, the kNN queries are also secure. Detailed 
proofs have to be skipped for space limitation. 

6 Experiments 

In this section, we present four sets of experimental 
results to investigate the following questions, corre- 
spondingly. (1) How expensive is the RASP pertur- 
bation? (2) How resilient the OPE enhanced RASP 
is to the ICA-based attack? (3) How efficient is the 
two-stage range query processing? (4) How efficient 
is the kNN-R query processing and what are the 
advantages? 

6.1 Datasets 

Three datasets are used in experiments. (1) A synthetic 
dataset that draws samples from uniform distribu- 
tion in the range [0, 1]. (2) The Adult dataset from 
UCI machine learning databas^l. We assign numeric 
values to the categorical values using a simple one- 
to-one mapping scheme, as described in Section [3] 
(3) The 2-dimensional NorthEast location data from 
rtreeportal.org. 

5. http:/ /archive.ics.uci.edu/ml/ 
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6.2 Cost of RASP Perturbation 

In this experiment, we study the costs of the com- 
ponents in the RASP perturbation. The major costs 
can be divided into two parts: the OPE and the rest 
part of RASP. We implement a simple OPE scheme H) 
by mapping original column distributions to normal 
distributions. The OPE algorithm partitions the target 
distribution into buckets. Then, the sorted original 
values are proportionally partitioned according to the 
target bucket distribution to create the buckets for the 
original distribution. With the aligned original and 
target buckets, an original value can be mapped to 
the target bucket and appropriately scaled. Therefore, 
the encryption cost mainly comes from the bucket 
search procedure (proportional to log D, where D 
is the number of buckets). Figure |6] shows the cost 
distributions for 20K records at different number of 
dimensions. The dimensionality has slight effects on 
the cost of RASP perturbation. Overall, the cost of 
processing 20K records is only around 0.1 second. 

6.3 Resilience to ICA Attack 

We have discussed the methods for countering the 
ICA distributional attack on the perturbed data. In 
this set of experiments, we evaluate how resilient the 
RASP perturbation is to the distributional attack. 

Results. We simulate the ICA attack for randomly 
chosen matrices A. The data used in the experiment 
is the 10-dimensional Adult data with 10K records. 
Figure shows the progressive results in a number 
of randomly chosen matrices A. The x-axis represents 
the total number of rounds for randomly choosing 
the matrix A; the y-axis represents the minimum 
dimensional NR_MSE among all dimension. With- 
out OPE, the label "Best-without-OPE" represents 
the most resilient A at the round i, "Worst-without- 
OPE" represents the A of the weakest resilience, and 



" Average- without-OPE" is the average quality of the 
generated A matrices for i rounds. We see that the best 
case is already close to the upper bound 0.7 (Section 
13.3b . With the OPE component, the worst case can also 
be significantly improved. 

6.4 Performance of Two-stage Range Query Pro- 
cessing 

In this set of experiments, we study the performance 
aspects of polyhedron-based range query processing. 
We use the two-stage processing strategy described in 
Section |4j and explore the additional cost incurred by 
this processing strategy. We implement the two-stage 
query processing based on an R*tree implementation 
provided by Dr. Hadjieleftheriou at AT&T LatQ. The 
block size is 4KB and we allow each block to contain 
only 20 entries to mimic a large database with many 
disk blocks. Samples from the original databases in 
different size (10,000 - 50,000 records, i.e., 500-2500 
data blocks) are perturbed and indexed for query 
processing. Another set of indices is also built on 
the original data for the performance comparison 
with non-perturbed query processing. We will use the 
number of disk block accesses, including index blocks 
and data blocks, to assess the performance to avoid 
the possible variation caused by other parts of the 
computer system. In addition, we will also show the 
wall-clock time for some results. 

Recall the two-stage processing strategy: using the 
MBR to search the indexing tree, and filtering the 
returned result with the secured query in quadratic 
form. We will study the performance of the first stage 
by comparing it to two additional methods: (1) the 
original queries with the index built on the original 
data, which is used to identify how much additional 

6. http://www2.research.att.com/ marioh/spatialindex/ 
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cost is paid for querying the MBR of the trans- 
formed query; (2) the linear scan approach, which 
is the worst case cost. Range queries are generated 
randomly within the domain of the datasets, and 
then transformed with the method described in the 
Section [4] We also control the range of the queries to 
be [10%,20%,30%,40%,50%] of the total range of the 
domain, to observe the effect of the scale of the range 
to the performance of query processing. 
Results. The first pair of figures (the left subfigures of 
Figure |8] and [9) shows the number of block accesses 
for 10,000 queries on different sizes of data with differ- 
ent query processing methods. For clear presentation, 
we use log 10 (# of block accesses) as the y-axis. The 
cost of linear scan is simply the number of blocks for 
storing the whole dataset. The data dimensionality is 
fixed to 5 and the query range is set to 30% of the 
whole domain. Obviously, the first stage with MBR for 
polyhedron has a cost much cheaper than the linear 
scan method and only moderately higher than R*tree 
processing on the original data. Interestingly, different 
distributions of data result in slightly different pat- 
terns. The costs of R*tree on transformed queries are 
very close to those of original queries for Adult data, 
while the gap is larger on uniform data. The costs 
over different dimensions and different query ranges 
show similar patterns. 



TABLE 1 

Wall clock cost distribution (milliseconds) and 
comparison. 



We also studied the cost of the second stage. We use 
"PrepQ" to represent the client-side cost of transform- 
ing queries, "purity" to represent the rate (final result 
count)/(lst stage result count), and records per query 
("RPQ") to represent the average number of records 
per query for the first stage results. The quadratic 
filtering conditions are used in experiments. Table [TJ 
compares the average wall-clock time (milliseconds) 
per query for the two stages, the RPQ values for stage 
1, and the purity of the stage-1 result. The tests are 
run with the setting of 10K queries, 20K records, 30% 
dimensional query range and 5 dimensions. Since the 
2nd stage is done in memory, its cost is much lower 
than the lst-stage cost. Overall, the two stage process- 
ing is much faster than linear scan and comparable to 
the original R*Tree processing. 

6.5 Performance of kNN-R Query Processing 

In this set of experiments, we investigate several 
aspects of kNN query processing. (1) We will study 
the cost of (k, (5)-Range algorithm, which mainly 
contributes to the server-side cost. (2) We will show 



the overall cost distribution over the cloud side and 
the proxy server. (3) We will show the advantages of 
kNN-R over another popular approach: the Casper 
approach [24] for privacy-preserving kNN search. 

(k, <5)-Range Algorithms In this set of experiments, 
we want to understand how the setting of the 5 
parameter affects the performance and the result 
precision. Figure [10] shows the effect of 6 setting 
to the (k, 5)-range algorithm. Both datasets are two- 
dimensional data. As 6 becomes larger, both the pre- 
cision and the number of rounds needs to reach the 
5 condition decreases. Note that each round corre- 
sponds to one server-side range query. The choice of 
5 represents a tradeoff between the precision and the 
performance. 




delta for jk, delta) -range 



delta for (k, delta) -range 



Fig. 10. Performance and result 
precision for different 5 setting of 
the (fc, £)-range algorithm for 2- 
dimensional data. 
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Uniform5D 


21.12 


0.27 


0.007 


4.19 


0.01 




Adult5D 


16.28 


0.39 


0.007 


1.9 


0.01 






Fig. 11. Preci- 
sion reduction 
with more 
dimension. 

As we have discussed, the major weakness with 
the kNN-R algorithm is the precision reduction with 
increased dimensionality. When the dimensionality 
increases, the precision can significantly drop, which 
will increase the cost of post-processing in the client 
side. Figure [TT] shows this phenomenon with the real 
Adult data and the simulated uniform data. However, 
compared to the overall cost, the client-side cost in- 
crease is still acceptable. We will show the comparison 
next. 

Overall Costs. Many secure approaches cannot use 
indices for query processing, which results in poor 
performance. For example, the secure dot-product 
approach 1 33 1 encodes the points with random projec- 
tions and recovers dot-products in query processing 
for distance comparison. The way of encoding data 
disallows the index-based query processing. Without 
the aid of indices, processing a kNN query will have 
to scan the entire database, leaving many optimization 
impossible to implement. 

One concern with the kNN-R approach is the work- 
load on the proxy server. Different from range query, 
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Fig. 8. Performance comparison on Uniform data. Left: data size vs. cost of query; Middle: data dimensionality 
vs. cost of query; Right: query range (percentage of the domain) vs. cost of query 
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Fig. 9. Performance comparison on Adult data. Left: data size vs. cost of query; Middle: data dimensionality vs. 
cost of query; Right: query range (percentage of the domain) vs. cost of query 



the proxy server will need to filter out the points 
returned by the server to find the final kNN. A 
reduced precision due to the increased dimensionality 
will imply an increased burden for the proxy server. 
We need to show how significant this proxy cost is. 

We use the database of 100 thousands of data points 
and 1000 randomly selected queries for the INN 
experiment. The wall clock time (milliseconds) is used 
to show the average cost per query in Table |2] We 
also list the cost of the secure dot-product method 
Il33l for comparison. Table |2] shows that the proxy 
server takes a negligible pre-processing cost and a 
very small post-processing cost, even for reduced 
precision in the 5D datasets. We use 5% domain length 
to extend the query point to form the initial higher 
bound. Compared to the dot-product method, the 
user-specified higher bound setting can cut off unin- 
teresting regions, giving significant performance gain 
for sparse or skewed datasets, such as Adult5D. This 
cut-off effect cannot be implemented with the dot- 
product method. Furthermore, even for dense cases 
like the 2D datasets, the overall cost is only about 
half of the dot-product method. 

Comparing kNN-R with the Casper Approach. In 

this set of experiments, we compare our approach 
and the Casper approach with a focus on the tradeoff 
between the data confidentiality and the query result 
precision (which indicates the workload of the in- 
house proxy). Based on the description in the paper 
[24). we implement the INN query processing algo- 
rithm for the experiment. 

The Casper approach uses cloaking boxes to hide 



Data& setting 


Liner Scan 


Pre-processing 


Server Cost 


Post-processing 


Uniform2D/kNN-R 


27.37 


0.01 


13.54 


0.04 


Adult2D/kNN-R 


26.09 


0.01 


14.48 


0.06 


Uniform5D/kNN-R 


33.03 


0.01 


13.79 


0.34 


Adult5D/kNN-R 


31.96 


0.01 


2.56 


0.05 



TABLE 2 

Per-query performance comparison (milliseconds) 
between linear scan on the original non-perturbed 
data and index-aided kNN-R processing on perturbed 
data. 



both the original data points in the database and the 
query points. It can also use the index to process 
kNN queries. The confidentiality of data in Casper 
is solely defined by the size of cloaking box. Roughly 
speaking, the actual point has the same probability 
to be anywhere in the cloaking box. However, the 
size of cloaking box also directly affects the precision 
of query results. Thus, the decision on the box size 
represents a tradeoff between the precision of query 
results and the data confidentiality. 

For clear presentation, we assume each dimension 
has the same length of domain, h and each cloak- 
ing box is square with an edge-length e. Assume 
the whole domain also has a uniform distribution. 
According to the variance of uniform distribution, 
the NR_MSE measure is \fQe/{ih). To achieve the 
protection of 10% domain length, we have e ~ 0.12/&. 

In Figure [TJJ the x-axis represents NR_MSE, i.e., 
the Casper's relative cloaking-edge length. It shows 
that when the edge length is increased from 2% to 
10%, the precision dramatically drops from 62% to 
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Fig. 12. The impact of cloaking-box size on precision 
for Casper for the NE data. 



13% for the 2D uniform data and 43% to 10% for the 
2D NE data, which shows the severe conflict between 
precision and confidentiality. The kNN-R's results are 
also shown for comparison. 



7 Related work 

7.1 Protecting Outsourced Data 

Order Preserving Encryption. Order preserving en- 
cryption (OPE) [1J preserves the dimensional value or- 
der after encryption. It can be described as a function 
y = F(x),Vxi,Xj,Xi < (>,=)xj ^ i/i < (>,=)%. A 
well-known attack is based on attacker's prior knowl- 
edge on the original distributions of the attributes. 
If the attacker knows the original distributions and 
manages to identify the mapping between the original 
attribute and its encrypted counterpart, a bucket- 
based distribution alignment can be performed to 
break the encryption for the attribute 0. There are 
some applications of OPE in outsourced data process- 
ing. For example, Yiu et al. [2"T| uses a hierarchical 
space division method to encode spatial data points, 
which preserves the order of dimensional values and 
thus is one kind of OPE. 

Crypto-Index. Crypto-Index is also based on 
column-wise bucketization. It assigns a random ID 
to each bucket; the values in the bucket are replaced 
with the bucket ID to generate the auxiliary data for 
indexing. To utilize the index for query processing, a 
normal range query condition has to be transformed 
to a set-based query on the bucket IDs. For example, 
Xi < a,; might be replaced with X[ e [IDi,ID 2 , IDS}. 
A bucket-diffusion scheme [14] was proposed to pro- 
tect the access pattern, which, however, has to sacrifice 
the precision of query results, and thus increase the 
client's cost of filtering the query result. 

Distance-Recoverable Encryption. DRE is the most 
intuitive method for preserving the nearest neighbor 
relationship. Because of the exactly preserved dis- 
tances, many attacks can be applied [33], [20J, [8J. 
Wong et al. [33J suggest preserving dot products 
instead of distances to find kNN, which is more 
resilient to distance-targeted attacks. One drawback 
is the search algorithm is limited to linear scan and 
no indexing method can be applied. 



7.2 Preserving Query Privacy 

Private information retrieval (PIR) [9] tries to fully 
preserve the privacy of access pattern, while the data 
may not be encrypted. PIR schemes are normally very 
costly. Focusing on the efficiency side of PIR, Williams 
et al. [32 1 use a pyramid hash index to implement effi- 
cient privacy preserving data-block operations based 
on the idea of Oblivious RAM. It is different from our 
setting of high throughput range query processing. 

Hu et al. l!T5l addresses the query privacy problem 
and requires the authorized query users, the data 
owner, and the cloud to collaboratively process kNN 
queries. However, most computing tasks are done 
in the user's local system with heavy interactions 
with the cloud server. The cloud server only aids 
query processing, which does not meet the principle 
of moving computing to the cloud. 

Papadopoulos et al. Il26l uses private information 
retrieval methods [9] to enhance location privacy. 
However, their approach does not consider protecting 
the confidentiality of data. SpaceTwist [35 1 proposes a 
method to query kNN by providing a fake user's loca- 
tion for preserving location privacy. But the method 
does not consider data confidentiality, as well. The 
Casper approach [24J considers both data confiden- 
tiality and query privacy, the detail of which has been 
discussed in our experiments. 

7.3 Other Related Work 

Another line of research [29] facilitates authorized 
users to access only the authorized portion of data, 
e.g., a certain range, with a public key scheme. How- 
ever, the underlying encryption schemes do not pro- 
duce indexable encrypted data. The setting of multi- 
dimensional range query in [29| is different from ours. 
Their approach requires that the data owner provides 
the indices and keys for the server, and authorized 
users use the data in the server. While in the cloud 
database scenario, the cloud server takes more respon- 
sibilities of indexing and query processing. Secure 
keyword search on encrypted documents [10], [31], [5] 
scans each encrypted document in the database and 
finds the documents containing the keyword, which 
is more like point search in database. The research on 
privacy preserving data mining has discussed multi- 
plicative perturbation methods [7], which are similar 
to the RASP encryption, but with more emphasis on 
preserving the utility for data mining. 

8 Conclusion 

We propose the RASP perturbation approach to host- 
ing query services in the cloud, which satisfies the 
CPEL criteria: data Confidentiality, query Privacy, 
Efficient query processing, and Low in-house work- 
load. The requirement on low in-house workload is a 
critical feature to fully realize the benefits of cloud 
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computing, and efficient query processing is a key 
measure of the quality of query services. 

RASP perturbation is a unique composition of OPE, 
dimensionality expansion, random noise injection, 
and random projection, which provides unique se- 
curity features. It aims to preserve the topology of 
the queried range in the perturbed space, and allows 
to use indices for efficient range query processing. 
With the topology-preserving features, we are able to 
develop efficient range query services to achieve sub- 
linear time complexity of processing queries. We then 
develop the kNN query service based on the range 
query service. The security of both the perturbed data 
and the protected queries is carefully analyzed under 
a precisely defined threat model. We also conduct 
several sets of experiments to show the efficiency 
of query processing and the low cost of in-house 
processing. 

We will continue our studies on two aspects: (1) 
further improve the performance of query processing 
for both range queries and kNN queries; (2) formally 
analyze the leaked query and access patterns and the 
possible effect on both data and query confidentiality. 
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9 Appendix 

9.1 Proofs. 

Proving that RASP is not OPE. 

Let y = (E ope (x) T , 1, v) T and we only need to 
prove that F(y) = Ay does not preserve the di- 
mensional value order. Let f* be the selection vector 
(0, . . . , 1, . . . , 0) i.e., only the i-th dimension is 1 and 
other dimensions are 0. Then, {P) T y will return the 
value at dimension i of y. 

Proof: Let A be an invertible matrix with at least 
two non-zero entries in each row. For any vector y, 
let y' = Ay. For any two vectors s and t, using the 
dimensional selection vector we have s[ = (P) T As 
and i- = (P) T At . If the dimensional order is pre- 
served, we will have (s, — ti)(s'i — t'A > 0. However, 

(s.-t^-t'A = ( Sl ~U)(P) T A(s-t) 

k 

= (si-ti)^2ai,j(sj -tj), (5) 
j=i 

where ajj is the i-th row j-th column element of A. 
Without loss of generality, let's assume Sj > ti (for 
Si < ti the same proof applies). It is straightforward 
to see that the sign of (s< — t i )(s' l — £■) is subject to 
the values Sj and tj in other dimensions j ^ i. As a 
result, RASP does not preserve the dimensional order. 

□ 

Proving that MBR^ MID ^ encloses MBR^ mid K 
Proof: In general, the MBR of an arbitrary poly- 
hedron can be derived based on the vertices of the 
polyhedron. Based on the property of convexity pre- 
serving of RASP, a polyhedron is mapped to another 
polyhedron in the encrypted space. Concretely, let 
a polyhedron P has m vertices {x%, . . . ,x m }, which 
are mapped to the vertices in the encrypted space: 
{yi, . . . , y m }- Then, the upper bound and lower bound 
of dimension j of the MBR of the polyhedron in 
the encrypted space are determined by max{yij,i = 
1 . . . m} and min{yy, i = 1 . . . m}, respectively. 

Since we only use MBR to reduce the set of 
results for filtering, a slightly larger MBR would 
still guarantee the correctness of the MBR based 
query processing algorithm, with possibly increased 
filtering cost. In the following, we try to find such 
a MBR to enclose MBR^ mid \ By the definition of 
the square ranges 

S (low) r g(mid) and S {high) i the j r 



vertices have the relationship 
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spondingly, the MBR( m2C ^ in the perturbed space 
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Let the j-th dimension of MBR^) represented as 

and s^ oa! = ma,x{y^ l9h \ i = l...m}. Now we 

choose the MBR( M/D ' as follows: for j-th dimension Algorithm 2 RASP Secure Query Transformation. 
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show that 

For two sets of m real values {ai 
{&i,..., b m }, it is easy to verify that 



[high)y 2 i w 

j, max // J 

. . . , a m } and 



max{ai, . . . , a m }+max{6i, . . . , b m } > max{ai+6i, . . . , a\ 

(6) 

min{ai, . . . , a m }+min{6i, . . . , b m } < min{ai+6i, . . . ,ai + 

(7) 
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1 . . . m} 
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, i,min' anQ \ S i,max 
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b i.max II z 



sjm'ox- Since for each dimension, MBR( M/£I ) encloses 



MBR( mid \ we have MBR< M/D ) encloses MBR^ mid \ □ 



9.2 Algorithms 



Algorithm 1 RASP Data Perturbation 

1: RASF_Ferturb(X, RNG, RIMG, K a ) 

2: Input: X: k x n data records, RNG: random real value 
generator that draws values from the standard normal 
distribution, RIMG : random invertible matrix genera- 
tor, K ope : key for OPE E ope ; Output: the matrix A 

3: A <- 0; 

4: <— the last column of A; 

5: v <- 4; 

6: while ^3 contains zero do 
7: generate ^ with RIMG; 
8: end while 

9: for each record x- in X do 
10: w vo — 1; 
11: while « < no do 
12: u <- RNG; 

13: end while 

14: y <- A({E op£ (x,K ope )) T ,l,v) T ; 
15: submit y to the server; 
16: end for 
17: return A; 



Algorithm[2]encodes a normal range query and gen- 
erate the Qi matrices and the MBR for the transformed 
query. 

In Algorithm |3j the two-stage query processing uses 
the MBR to find the initial query result and then fil- 
ters the result with the transformed query conditions 
y T QiU < 0, where the matrices {Qi} and the MBR are 
passed by the client and y is each perturbed record. 

The following Algorithm H] describes the details of 
the (K, 5)-Range algorithm for determining the inner 
range. 



1: QuadraticQuery(Cond, A) 

2: Input: Cond: 2d simple conditions for d-dimensional 
data, 2 conditions for each dimension. A:the perturba- 
tion matrix. Output: the MBR of the transformed range 
and the quadratic query matrices Qi,i = 1 ... 2d. 



b m } 



10 
11 
12 
13 
14 
15 
16 
17 

18: 



4; 

for each condition C, in Cond do 



u <s— zeros(d + 2, 1); 

if d is like Xj < Oj then 

<— 1, ud+i i af, 



end if 

if d is like Xj > a,j then 

Uj i 1, Ud+i <— af, 

end if 

w 4— zeros(d + 2, 1); 
w d +2 1; 
Wd+i <- vo; 

Qi <- (A-Y^A- 1 ; 
end for 

Use the vertex transformation method to find the MBR 
of the transformed queries; 
return MBR and {Qi, i = 1 . . . 2d}; 



Algorithm 3 Two-Stage Query Processing. 



9 
10 
11 
12 
13 
14: 
15 
16 
17: 



ProcessQuery(M.Bi?, {Q 4 }) 

Input: MBR: MBR for the transformed query; 
{Qi}:filtering conditions; Output: the set of per- 
turbed records satisfying the conditions. 

Y <— use the indexing tree to find answers for 
MBR; 

F'^0; 

for each record y in Y do 
success <— 1 

for each condition Qi do 
if y T Q iV > then 
success <— 0; 
break; 
end if 
end for 

if success = 1 then 
add yi into Y'; 
end if 
end for 

return Y 1 to the client; 
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Algorithm 4 (K, 5)-Range Algorithm 



1 

2: 
3 
4: 
5 
6: 
7: 
8 
9 
10: 
11 
12: 
13: 
14: 
15: 



procedure (K, 5)-Range(Li, £ m , k, S) 
high £ m/ low <— Li; 
while /wg/i — low > £ do 
mid (/wg/i + low)/2; 
num <- number of points in 
if nuui > khhnum ^ k + 8 then 

Break the loop; 
else if num > k + delta then 

high 4— mid; 
else 

low 4- mid; 
end if 
end while 
return 
end procedure 



